Refactor of beam search to process factor groups in parallel #772

rhenry-nv · 2020-12-08T04:58:59Z

Description

This PR refactors the beam search to process the secondary factors in parallel.

As is, this work significantly reduces the H2D communication required when processing the secondary factors in a model with a factored vocabulary.

Note - The changes in this PR are not integrated into PR #743. The table below shows the improvements on top of PR #770.

Times with 1 stream

Batch	Initial Time (s)	Current Time(s)	% Runtime reduction	Speedup factor
1	166.392	162.169	0.025379826	1.026040735
2	113.005	109.386	0.032025132	1.033084673
4	69.5436	66.8378	0.038907966	1.04048308
8	40.7636	39.2434	0.037293075	1.038737724
16	23.9823	23.2354	0.031143802	1.032144917
32	14.7954	14.3573	0.029610555	1.030514094
64	9.6	9.01939	0.060480208	1.064373533
128	6.48	6.04882	0.066540123	1.071283325
256	4.65	4.28963	0.077498925	1.084009577

Times with 2 streams

Batch	Initial Time (s)	Current Time(s)	% Runtime reduction	Speedup factor
1	116.7	110.365	0.05428449	1.057400444
2	78.46	74.248	0.053683406	1.056728801
4	47.83	45.4441	0.049882919	1.052501865
8	28.107	26.7974	0.046593375	1.048870413
16	16.69	15.6832	0.060323547	1.064196082
32	10.24	9.79104	0.04384375	1.045854169
64	6.57	6.22622	0.052325723	1.055214882
128	4.65	4.22329	0.091765591	1.101037343
256	3.47	3.2422	0.065648415	1.070260934

List of changes:

Adds a kernel to perform a max reduction on the last axis of a tensor for GPU. This cuts down on the kernel launches needed and removes a stream synchronize for every call.
Beam search refactor to batch secondary factors
Some changes from PR Small optimizations #768 to reduce index copying.

Added dependencies: cub

How to test

I ran the regression tests and tested manually with a proxy model.

CMake command: cmake .. -DCOMPILE_CPU=on -DCOMPILE_CUDA=on -DUSE_SENTENCEPIECE=on -DUSE_STATIC_LIBS=off -DCOMPILE_SERVER=off -DUSE_FBGEMM=on -DCOMPILE_CUDA_SM35=off -DCOMPILE_CUDA_SM50=off -DCOMPILE_CUDA_SM60=off -DCOMPILE_CUDA_SM70=on -DCOMPILE_CUDA_SM75=off -DCOMPILE_TESTS=on

Ubuntu - 18.04.3 LTS
nvcc - 10.1.243
gcc - 7.5.0

Checklist

I have tested the code manually
I have run regression tests
I have read and followed CONTRIBUTING.md
I have updated CHANGELOG.md

…ase two will be to implement operators to take advantage of the explicit batching

rhenry-nv · 2021-03-11T01:16:39Z

This is broken as it does not handling forwarding hypotheses which could not be expanded by certain factor groups properly.

rhenry-nv added 9 commits December 7, 2020 10:59

Refactors beam search to batch computation of secondary factors. A ph…

52f1209

…ase two will be to implement operators to take advantage of the explicit batching

Touches up beam search refactor

de252df

Adds changes to make refactor compatible with master

eb34814

Adds cub as a submodule

6a5b6e0

Adds a fast path for max reduction to reduce H2D/D2H communication.

51532bf

Update change log

e3f4ed8

Fix windows compile errors

1e9b3b4

Explicit cast to int for windows build

799c2f9

Fixes compile errors on Windows in beam_search.cpp

fdc278d

rhenry-nv mentioned this pull request Dec 15, 2020

Introduces new operator to get the lemma logits for factored vocabulary models for GPU inference #776

Open

4 tasks

rhenry-nv closed this Mar 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor of beam search to process factor groups in parallel #772

Refactor of beam search to process factor groups in parallel #772

rhenry-nv commented Dec 8, 2020 •

edited

Loading

rhenry-nv commented Mar 11, 2021

Refactor of beam search to process factor groups in parallel #772

Refactor of beam search to process factor groups in parallel #772

Conversation

rhenry-nv commented Dec 8, 2020 • edited Loading

Description

Times with 1 stream

Times with 2 streams

How to test

Checklist

rhenry-nv commented Mar 11, 2021

rhenry-nv commented Dec 8, 2020 •

edited

Loading